India Rainfall Analysis¶
Motivation and Description¶
Monsoon prediction is clearly of great importance for India.Two types of rainfall predictions can be done, They are
- Long term predictions: Predict rainfall over few weeks/months in advance.
- Short term predictions: Predict rainfall a few days in advance in specific locations.
Indian meteorological department provides forecasting data required for project. In this project we are planning to work on long term predictions of rainfall. The main motive of the project is to predict the amount of rainfall in a particular division or state well in advance. We predict the amount of rainfall using past data.
Dataset¶
- Dataset1(dataset1) This dataset has average rainfall from 1951-2000 for each district, for every month.
- Dataset2(dataset2) This dataset has average rainfall for every year from 1901-2005 for each state.
Methodology¶
- Converting data in to the correct format to conduct experiments.
- Make a good analysis of data and observe variation in the patterns of rainfall.
- Finally, we try to predict the average rainfall by separating data into training and testing. We apply various statistical and machine learning approaches(SVM, etc) in prediction and make analysis over various approaches. By using various approaches we try to minimize the error.
In [59]:
import numpy as np # linear algebra #natrix
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
In [60]:
# import plotly.express as px
# df = px.data.iris()
# fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species", marginal_y="violin",
# marginal_x="box", trendline="ols", template="simple_white")
# fig.show()
Types of graphs¶
- Bar graphs showing distribution of amount of rainfall.
- Distribution of amount of rainfall yearly, monthly, groups of months.
- Distribution of rainfall in subdivisions, districts form each month, groups of months.
- Heat maps showing correlation between amount of rainfall between months.
In [61]:
data = pd.read_csv(r"H:/4th Year/Sem 8/MaP2/rainfall-prediction-master/data/rainfall_in_india_1901-2015.csv",sep=",")
# data = data.fillna(data.mean())
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4116 entries, 0 to 4115 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SUBDIVISION 4116 non-null object 1 YEAR 4116 non-null int64 2 JAN 4112 non-null float64 3 FEB 4113 non-null float64 4 MAR 4110 non-null float64 5 APR 4112 non-null float64 6 MAY 4113 non-null float64 7 JUN 4111 non-null float64 8 JUL 4109 non-null float64 9 AUG 4112 non-null float64 10 SEP 4110 non-null float64 11 OCT 4109 non-null float64 12 NOV 4105 non-null float64 13 DEC 4106 non-null float64 14 ANNUAL 4090 non-null float64 15 Jan-Feb 4110 non-null float64 16 Mar-May 4107 non-null float64 17 Jun-Sep 4106 non-null float64 18 Oct-Dec 4103 non-null float64 dtypes: float64(17), int64(1), object(1) memory usage: 611.1+ KB
Dataset-1 Description¶
- Data has 36 sub divisions and 19 attributes (individual months, annual, combinations of 3 consecutive months).
- For some of the subdivisions data is from 1950 to 2005.
- All the attributes has the sum of amount of rainfall in mm.
In [62]:
data = data.fillna(data.mean(numeric_only = True))
In [63]:
data.head()
Out[63]:
| SUBDIVISION | YEAR | JAN | FEB | MAR | APR | MAY | JUN | JUL | AUG | SEP | OCT | NOV | DEC | ANNUAL | Jan-Feb | Mar-May | Jun-Sep | Oct-Dec | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ANDAMAN & NICOBAR ISLANDS | 1901 | 49.2 | 87.1 | 29.2 | 2.3 | 528.8 | 517.5 | 365.1 | 481.1 | 332.6 | 388.5 | 558.2 | 33.6 | 3373.2 | 136.3 | 560.3 | 1696.3 | 980.3 |
| 1 | ANDAMAN & NICOBAR ISLANDS | 1902 | 0.0 | 159.8 | 12.2 | 0.0 | 446.1 | 537.1 | 228.9 | 753.7 | 666.2 | 197.2 | 359.0 | 160.5 | 3520.7 | 159.8 | 458.3 | 2185.9 | 716.7 |
| 2 | ANDAMAN & NICOBAR ISLANDS | 1903 | 12.7 | 144.0 | 0.0 | 1.0 | 235.1 | 479.9 | 728.4 | 326.7 | 339.0 | 181.2 | 284.4 | 225.0 | 2957.4 | 156.7 | 236.1 | 1874.0 | 690.6 |
| 3 | ANDAMAN & NICOBAR ISLANDS | 1904 | 9.4 | 14.7 | 0.0 | 202.4 | 304.5 | 495.1 | 502.0 | 160.1 | 820.4 | 222.2 | 308.7 | 40.1 | 3079.6 | 24.1 | 506.9 | 1977.6 | 571.0 |
| 4 | ANDAMAN & NICOBAR ISLANDS | 1905 | 1.3 | 0.0 | 3.3 | 26.9 | 279.5 | 628.7 | 368.7 | 330.5 | 297.0 | 260.7 | 25.4 | 344.7 | 2566.7 | 1.3 | 309.7 | 1624.9 | 630.8 |
In [64]:
data.describe()
Out[64]:
| YEAR | JAN | FEB | MAR | APR | MAY | JUN | JUL | AUG | SEP | OCT | NOV | DEC | ANNUAL | Jan-Feb | Mar-May | Jun-Sep | Oct-Dec | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4116.000000 | 4116.000000 | 4116.000000 | 4116.000000 | 4116.000000 | 4116.000000 | 4116.000000 | 4116.000000 | 4116.000000 | 4116.000000 | 4116.000000 | 4116.000000 | 4116.000000 | 4116.000000 | 4116.000000 | 4116.000000 | 4116.000000 | 4116.000000 |
| mean | 1958.218659 | 18.957320 | 21.805325 | 27.359197 | 43.127432 | 85.745417 | 230.234444 | 347.214334 | 290.263497 | 197.361922 | 95.507009 | 39.866163 | 18.870580 | 1411.008900 | 40.747786 | 155.901753 | 1064.724769 | 154.100487 |
| std | 33.140898 | 33.569044 | 35.896396 | 46.925176 | 67.798192 | 123.189974 | 234.568120 | 269.310313 | 188.678707 | 135.309591 | 99.434452 | 68.593545 | 42.318098 | 900.986632 | 59.265023 | 201.096692 | 706.881054 | 166.678751 |
| min | 1901.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.400000 | 0.000000 | 0.000000 | 0.100000 | 0.000000 | 0.000000 | 0.000000 | 62.300000 | 0.000000 | 0.000000 | 57.400000 | 0.000000 |
| 25% | 1930.000000 | 0.600000 | 0.600000 | 1.000000 | 3.000000 | 8.600000 | 70.475000 | 175.900000 | 156.150000 | 100.600000 | 14.600000 | 0.700000 | 0.100000 | 806.450000 | 4.100000 | 24.200000 | 574.375000 | 34.200000 |
| 50% | 1958.000000 | 6.000000 | 6.700000 | 7.900000 | 15.700000 | 36.700000 | 138.900000 | 284.900000 | 259.500000 | 174.100000 | 65.750000 | 9.700000 | 3.100000 | 1125.450000 | 19.300000 | 75.200000 | 882.250000 | 98.800000 |
| 75% | 1987.000000 | 22.125000 | 26.800000 | 31.225000 | 49.825000 | 96.825000 | 304.950000 | 418.225000 | 377.725000 | 265.725000 | 148.300000 | 45.825000 | 17.700000 | 1635.100000 | 50.300000 | 196.900000 | 1287.550000 | 212.600000 |
| max | 2015.000000 | 583.700000 | 403.500000 | 605.600000 | 595.100000 | 1168.600000 | 1609.900000 | 2362.800000 | 1664.600000 | 1222.000000 | 948.300000 | 648.900000 | 617.500000 | 6331.100000 | 699.500000 | 1745.800000 | 4536.900000 | 1252.500000 |
In [65]:
data.hist(figsize=(24,24));
Observations¶
- Above histograms show the distribution of rainfall over months.
- Observed increase in amount of rainfall over months July, August, September.
In [66]:
data.groupby("YEAR").sum()['ANNUAL'].plot(figsize=(12,8));
Observations¶
- Shows distribution of rainfall over years.
- Observed high amount of rainfall in 1950s.
In [67]:
data[['YEAR', 'JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL',
'AUG', 'SEP', 'OCT', 'NOV', 'DEC']].groupby("YEAR").sum().plot(figsize=(13,8));
In [68]:
px.line(data[['YEAR','Jan-Feb', 'Mar-May',
'Jun-Sep', 'Oct-Dec']])
In [69]:
data[['YEAR','Jan-Feb', 'Mar-May',
'Jun-Sep', 'Oct-Dec']].groupby("YEAR").sum().plot(figsize=(13,8));
Observations¶
- The above two graphs show the distribution of rainfall over months.
- The graphs clearly shows that amount of rainfall in high in the months july, aug, sep which is monsoon season in India.
In [70]:
data[['SUBDIVISION', 'JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL',
'AUG', 'SEP', 'OCT', 'NOV', 'DEC']].groupby("SUBDIVISION").mean().plot.barh(stacked=True,figsize=(13,8));
In [71]:
data[['SUBDIVISION', 'Jan-Feb', 'Mar-May',
'Jun-Sep', 'Oct-Dec']].groupby("SUBDIVISION").sum().plot.barh(stacked=True,figsize=(16,8));
In [72]:
px.box(data[['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL',
'AUG', 'SEP', 'OCT', 'NOV', 'DEC']])
Observations¶
- Above two graphs shows that the amount of rainfall is reasonably good in the months of march, april, may in eastern India.
In [73]:
plt.figure(figsize=(11,4))
sns.heatmap(data[['Jan-Feb','Mar-May','Jun-Sep','Oct-Dec','ANNUAL']].corr(),annot=True)
plt.show()
In [74]:
plt.figure(figsize=(11,4))
sns.heatmap(data[['JAN','FEB','MAR','APR','MAY','JUN','JUL','AUG','SEP','OCT','NOV','DEC','ANNUAL']].corr(),annot=True)
plt.show()
In [75]:
px.scatter(data[['Jan-Feb','Mar-May','Jun-Sep','Oct-Dec','ANNUAL']])
Observations¶
- Heat Map shows the co-relation(dependency) betwenn the amounts of rainfall over months.
- From above it is clear that if amount of rainfall is high in the months of july, august, september then the amount of rainfall will be high annually.
- It is also obwserved that if amount of rainfall in good in the months of october, november, december then the rainfall is going to b good in the overall year.
In [76]:
#Function to plot the graphs
def plot_graphs(groundtruth,prediction,title):
N = 9
ind = np.arange(N) # the x locations for the groups
width = 0.27 # the width of the bars
fig = plt.figure()
fig.suptitle(title, fontsize=12)
ax = fig.add_subplot(111)
rects1 = ax.bar(ind, groundtruth, width, color='b')
rects2 = ax.bar(ind+width, prediction, width, color='g')
ax.set_xlabel("Month of the Year")
ax.set_ylabel("Amount of rainfall")
ax.set_xticks(ind+width)
ax.set_xticklabels( ('APR', 'MAY', 'JUN', 'JUL','AUG', 'SEP', 'OCT', 'NOV', 'DEC') )
ax.legend( (rects1[0], rects2[0]), ('Ground truth', 'Prediction') )
# autolabel(rects1)
for rect in rects1:
h = rect.get_height()
ax.text(rect.get_x()+rect.get_width()/2., 1.05*h, '%d'%int(h),
ha='center', va='bottom')
for rect in rects2:
h = rect.get_height()
ax.text(rect.get_x()+rect.get_width()/2., 1.05*h, '%d'%int(h),
ha='center', va='bottom')
# autolabel(rects2)
plt.show()
Predictions¶
- For prediction we formatted data in the way, given the rainfall in the last three months we try to predict the rainfall in the next consecutive month.
- For all the experiments we used 80:20 training and test ratio.
- Linear regression
- SVR
- Artificial neural nets
- Tersting metrics: We used Mean absolute error to train the models.
- We also shown the amount of rainfall actually and predicted with the histogram plots.
- We did two types of trainings once training on complete dataset and other with training with only telangana data
- All means are standard deviation observations are written, first one represents ground truth, second one represents predictions.
In [77]:
# seperation of training and testing data
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
division_data = np.asarray(data[['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL',
'AUG', 'SEP', 'OCT', 'NOV', 'DEC']])
X = None; y = None
for i in range(division_data.shape[1]-3):
if X is None:
X = division_data[:, i:i+3]
y = division_data[:, i+3]
else:
X = np.concatenate((X, division_data[:, i:i+3]), axis=0)
y = np.concatenate((y, division_data[:, i+3]), axis=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
In [78]:
#test 2010
temp = data[['SUBDIVISION','JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL',
'AUG', 'SEP', 'OCT', 'NOV', 'DEC']].loc[data['YEAR'] == 2010]
data_2010 = np.asarray(temp[['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL',
'AUG', 'SEP', 'OCT', 'NOV', 'DEC']].loc[temp['SUBDIVISION'] == 'TELANGANA'])
X_year_2010 = None; y_year_2010 = None
for i in range(data_2010.shape[1]-3):
if X_year_2010 is None:
X_year_2010 = data_2010[:, i:i+3]
y_year_2010 = data_2010[:, i+3]
else:
X_year_2010 = np.concatenate((X_year_2010, data_2010[:, i:i+3]), axis=0)
y_year_2010 = np.concatenate((y_year_2010, data_2010[:, i+3]), axis=0)
In [79]:
#test 2005
temp = data[['SUBDIVISION','JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL',
'AUG', 'SEP', 'OCT', 'NOV', 'DEC']].loc[data['YEAR'] == 2005]
data_2005 = np.asarray(temp[['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL',
'AUG', 'SEP', 'OCT', 'NOV', 'DEC']].loc[temp['SUBDIVISION'] == 'TELANGANA'])
X_year_2005 = None; y_year_2005 = None
for i in range(data_2005.shape[1]-3):
if X_year_2005 is None:
X_year_2005 = data_2005[:, i:i+3]
y_year_2005 = data_2005[:, i+3]
else:
X_year_2005 = np.concatenate((X_year_2005, data_2005[:, i:i+3]), axis=0)
y_year_2005 = np.concatenate((y_year_2005, data_2005[:, i+3]), axis=0)
In [80]:
#terst 2005
temp = data[['SUBDIVISION','JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL',
'AUG', 'SEP', 'OCT', 'NOV', 'DEC']].loc[data['YEAR'] == 2015]
data_2015 = np.asarray(temp[['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL',
'AUG', 'SEP', 'OCT', 'NOV', 'DEC']].loc[temp['SUBDIVISION'] == 'TELANGANA'])
X_year_2015 = None; y_year_2015 = None
for i in range(data_2015.shape[1]-3):
if X_year_2015 is None:
X_year_2015 = data_2015[:, i:i+3]
y_year_2015 = data_2015[:, i+3]
else:
X_year_2015 = np.concatenate((X_year_2015, data_2015[:, i:i+3]), axis=0)
y_year_2015 = np.concatenate((y_year_2015, data_2015[:, i+3]), axis=0)
In [81]:
from sklearn import linear_model
# linear model
reg = linear_model.ElasticNet(alpha=0.5)
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
print (mean_absolute_error(y_test, y_pred))
96.32435229744083
In [82]:
#2005
y_year_pred_2005 = reg.predict(X_year_2005)
#2010
y_year_pred_2010 = reg.predict(X_year_2010)
y_year_pred_2015 = reg.predict(X_year_2015)
print ("MEAN 2005")
print (np.mean(y_year_2005),np.mean(y_year_pred_2005))
print ("Standard deviation 2005")
print (np.sqrt(np.var(y_year_2005)),np.sqrt(np.var(y_year_pred_2005)))
print ("MEAN 2010")
print (np.mean(y_year_2010),np.mean(y_year_pred_2010))
print ("Standard deviation 2010")
print (np.sqrt(np.var(y_year_2010)),np.sqrt(np.var(y_year_pred_2010)))
print ("MEAN 2015")
print (np.mean(y_year_2015),np.mean(y_year_pred_2015))
print ("Standard deviation 2015")
print (np.sqrt(np.var(y_year_2015)),np.sqrt(np.var(y_year_pred_2015)))
plot_graphs(y_year_2005,y_year_pred_2005,"Year-2005")
plot_graphs(y_year_2010,y_year_pred_2010,"Year-2010")
plot_graphs(y_year_2015,y_year_pred_2015,"Year-2015")
# px.bar(y_year_2015,y_year_pred_2015)
MEAN 2005 121.2111111111111 134.68699821349804 Standard deviation 2005 123.77066107608005 90.86310230416439 MEAN 2010 139.93333333333334 144.80501326515912 Standard deviation 2010 135.71320250194282 95.94931363601724 MEAN 2015 88.52222222222223 119.64752006738831 Standard deviation 2015 86.62446123324875 62.36355370163372
In [83]:
from sklearn.svm import SVR
# SVM model
clf = SVR(gamma='auto', C=0.1, epsilon=0.2)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print (mean_absolute_error(y_test, y_pred))
127.1600615632603
In [84]:
#2005
y_year_pred_2005 = reg.predict(X_year_2005)
#2010
y_year_pred_2010 = reg.predict(X_year_2010)
#2015
y_year_pred_2015 = reg.predict(X_year_2015)
print ("MEAN 2005")
print (np.mean(y_year_2005),np.mean(y_year_pred_2005))
print ("Standard deviation 2005")
print (np.sqrt(np.var(y_year_2005)),np.sqrt(np.var(y_year_pred_2005)))
print ("MEAN 2010")
print (np.mean(y_year_2010),np.mean(y_year_pred_2010))
print ("Standard deviation 2010")
print (np.sqrt(np.var(y_year_2010)),np.sqrt(np.var(y_year_pred_2010)))
print ("MEAN 2015")
print (np.mean(y_year_2015),np.mean(y_year_pred_2015))
print ("Standard deviation 2015")
print (np.sqrt(np.var(y_year_2015)),np.sqrt(np.var(y_year_pred_2015)))
plot_graphs(y_year_2005,y_year_pred_2005,"Year-2005")
plot_graphs(y_year_2010,y_year_pred_2010,"Year-2010")
plot_graphs(y_year_2015,y_year_pred_2015,"Year-2015")
MEAN 2005 121.2111111111111 134.68699821349804 Standard deviation 2005 123.77066107608005 90.86310230416439 MEAN 2010 139.93333333333334 144.80501326515912 Standard deviation 2010 135.71320250194282 95.94931363601724 MEAN 2015 88.52222222222223 119.64752006738831 Standard deviation 2015 86.62446123324875 62.36355370163372
In [85]:
from keras.models import Model
from keras.layers import Dense, Input, Conv1D, Flatten
# NN model
inputs = Input(shape=(3,1))
x = Conv1D(64, 2, padding='same', activation='elu')(inputs)
x = Conv1D(128, 2, padding='same', activation='elu')(x)
x = Flatten()(x)
x = Dense(128, activation='elu')(x)
x = Dense(64, activation='elu')(x)
x = Dense(32, activation='elu')(x)
x = Dense(1, activation='linear')(x)
model = Model(inputs=[inputs], outputs=[x])
model.compile(loss='mean_squared_error', optimizer='adamax', metrics=['mae'])
model.summary()
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 3, 1)] 0
conv1d (Conv1D) (None, 3, 64) 192
conv1d_1 (Conv1D) (None, 3, 128) 16512
flatten (Flatten) (None, 384) 0
dense (Dense) (None, 128) 49280
dense_1 (Dense) (None, 64) 8256
dense_2 (Dense) (None, 32) 2080
dense_3 (Dense) (None, 1) 33
=================================================================
Total params: 76353 (298.25 KB)
Trainable params: 76353 (298.25 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
In [86]:
model.fit(x=np.expand_dims(X_train, axis=2), y=y_train, batch_size=64, epochs=30, verbose=1, validation_split=0.1, shuffle=True)
y_pred = model.predict(np.expand_dims(X_test, axis=2))
print (mean_absolute_error(y_test, y_pred))
Epoch 1/30 469/469 [==============================] - 4s 6ms/step - loss: 20048.7051 - mae: 88.0432 - val_loss: 17790.2695 - val_mae: 87.2101 Epoch 2/30 469/469 [==============================] - 3s 6ms/step - loss: 18590.6836 - mae: 86.5963 - val_loss: 17492.6152 - val_mae: 86.1020 Epoch 3/30 469/469 [==============================] - 2s 5ms/step - loss: 18464.1328 - mae: 86.4011 - val_loss: 17713.4082 - val_mae: 88.1468 Epoch 4/30 469/469 [==============================] - 2s 5ms/step - loss: 18494.6621 - mae: 86.2514 - val_loss: 17419.8516 - val_mae: 85.9579 Epoch 5/30 469/469 [==============================] - 3s 6ms/step - loss: 18364.5234 - mae: 86.1242 - val_loss: 18251.0723 - val_mae: 82.8346 Epoch 6/30 469/469 [==============================] - 2s 5ms/step - loss: 18358.2754 - mae: 86.1360 - val_loss: 17427.7363 - val_mae: 85.9941 Epoch 7/30 469/469 [==============================] - 2s 5ms/step - loss: 18261.5645 - mae: 85.8212 - val_loss: 17553.6562 - val_mae: 86.5898 Epoch 8/30 469/469 [==============================] - 2s 5ms/step - loss: 18235.7559 - mae: 85.6793 - val_loss: 17132.9551 - val_mae: 83.5857 Epoch 9/30 469/469 [==============================] - 2s 5ms/step - loss: 18130.5762 - mae: 85.5270 - val_loss: 17429.0723 - val_mae: 85.7730 Epoch 10/30 469/469 [==============================] - 2s 5ms/step - loss: 18151.4941 - mae: 85.4313 - val_loss: 17761.3945 - val_mae: 88.6346 Epoch 11/30 469/469 [==============================] - 2s 5ms/step - loss: 18123.6719 - mae: 85.6369 - val_loss: 17250.7324 - val_mae: 83.2518 Epoch 12/30 469/469 [==============================] - 3s 5ms/step - loss: 18135.6582 - mae: 85.4035 - val_loss: 17464.4434 - val_mae: 87.3533 Epoch 13/30 469/469 [==============================] - 3s 6ms/step - loss: 18025.4121 - mae: 85.2521 - val_loss: 17350.4336 - val_mae: 84.1670 Epoch 14/30 469/469 [==============================] - 3s 6ms/step - loss: 18068.4121 - mae: 85.1873 - val_loss: 17049.9883 - val_mae: 84.7489 Epoch 15/30 469/469 [==============================] - 3s 5ms/step - loss: 17976.3203 - mae: 85.0841 - val_loss: 17243.8965 - val_mae: 84.4475 Epoch 16/30 469/469 [==============================] - 2s 5ms/step - loss: 17980.6465 - mae: 85.0565 - val_loss: 16952.8125 - val_mae: 83.5256 Epoch 17/30 469/469 [==============================] - 3s 6ms/step - loss: 17926.0586 - mae: 84.8479 - val_loss: 17217.4863 - val_mae: 84.8391 Epoch 18/30 469/469 [==============================] - 3s 6ms/step - loss: 17923.0098 - mae: 84.8559 - val_loss: 17166.1367 - val_mae: 85.9009 Epoch 19/30 469/469 [==============================] - 3s 6ms/step - loss: 17890.9004 - mae: 84.7324 - val_loss: 16922.4199 - val_mae: 83.7106 Epoch 20/30 469/469 [==============================] - 3s 6ms/step - loss: 17807.5254 - mae: 84.5616 - val_loss: 17023.0020 - val_mae: 82.3268 Epoch 21/30 469/469 [==============================] - 3s 5ms/step - loss: 17755.3887 - mae: 84.5581 - val_loss: 17115.4102 - val_mae: 83.5145 Epoch 22/30 469/469 [==============================] - 3s 5ms/step - loss: 17743.2715 - mae: 84.4051 - val_loss: 17064.6113 - val_mae: 84.9584 Epoch 23/30 469/469 [==============================] - 3s 5ms/step - loss: 17670.2461 - mae: 84.4333 - val_loss: 17600.5391 - val_mae: 85.8445 Epoch 24/30 469/469 [==============================] - 3s 5ms/step - loss: 17675.9238 - mae: 84.2927 - val_loss: 16890.9570 - val_mae: 82.7249 Epoch 25/30 469/469 [==============================] - 2s 5ms/step - loss: 17621.6992 - mae: 84.0671 - val_loss: 17052.2383 - val_mae: 84.5972 Epoch 26/30 469/469 [==============================] - 2s 5ms/step - loss: 17579.6895 - mae: 84.0536 - val_loss: 16896.7773 - val_mae: 83.1722 Epoch 27/30 469/469 [==============================] - 2s 5ms/step - loss: 17561.6934 - mae: 83.9348 - val_loss: 16902.0176 - val_mae: 84.7980 Epoch 28/30 469/469 [==============================] - 3s 5ms/step - loss: 17521.4219 - mae: 83.9597 - val_loss: 16993.0352 - val_mae: 83.8641 Epoch 29/30 469/469 [==============================] - 2s 5ms/step - loss: 17487.1641 - mae: 83.8764 - val_loss: 17374.5312 - val_mae: 85.5506 Epoch 30/30 469/469 [==============================] - 2s 5ms/step - loss: 17429.5371 - mae: 83.7223 - val_loss: 17012.5020 - val_mae: 82.2100 116/116 [==============================] - 0s 3ms/step 84.45708244925238
In [87]:
#2005
y_year_pred_2005 = reg.predict(X_year_2005)
#2010
y_year_pred_2010 = reg.predict(X_year_2010)
#2015
y_year_pred_2015 = reg.predict(X_year_2015)
print ("MEAN 2005")
print (np.mean(y_year_2005),np.mean(y_year_pred_2005))
print ("Standard deviation 2005")
print (np.sqrt(np.var(y_year_2005)),np.sqrt(np.var(y_year_pred_2005)))
print ("MEAN 2010")
print (np.mean(y_year_2010),np.mean(y_year_pred_2010))
print ("Standard deviation 2010")
print (np.sqrt(np.var(y_year_2010)),np.sqrt(np.var(y_year_pred_2010)))
print ("MEAN 2015")
print (np.mean(y_year_2015),np.mean(y_year_pred_2015))
print ("Standard deviation 2015")
print (np.sqrt(np.var(y_year_2015)),np.sqrt(np.var(y_year_pred_2015)))
plot_graphs(y_year_2005,y_year_pred_2005,"Year-2005")
plot_graphs(y_year_2010,y_year_pred_2010,"Year-2010")
plot_graphs(y_year_2015,y_year_pred_2015,"Year-2015")
MEAN 2005 121.2111111111111 134.68699821349804 Standard deviation 2005 123.77066107608005 90.86310230416439 MEAN 2010 139.93333333333334 144.80501326515912 Standard deviation 2010 135.71320250194282 95.94931363601724 MEAN 2015 88.52222222222223 119.64752006738831 Standard deviation 2015 86.62446123324875 62.36355370163372
In [88]:
# spliting training and testing data only for telangana
telangana = np.asarray(data[['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL',
'AUG', 'SEP', 'OCT', 'NOV', 'DEC']].loc[data['SUBDIVISION'] == 'TELANGANA'])
X = None; y = None
for i in range(telangana.shape[1]-3):
if X is None:
X = telangana[:, i:i+3]
y = telangana[:, i+3]
else:
X = np.concatenate((X, telangana[:, i:i+3]), axis=0)
y = np.concatenate((y, telangana[:, i+3]), axis=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01, random_state=42)
In [89]:
from sklearn import linear_model
# linear model
reg = linear_model.ElasticNet(alpha=0.5)
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
print (mean_absolute_error(y_test, y_pred))
64.72601914484643
In [90]:
#2005
y_year_pred_2005 = reg.predict(X_year_2005)
#2010
y_year_pred_2010 = reg.predict(X_year_2010)
#2015
y_year_pred_2015 = reg.predict(X_year_2015)
print ("MEAN 2005")
print (np.mean(y_year_2005),np.mean(y_year_pred_2005))
print ("Standard deviation 2005")
print (np.sqrt(np.var(y_year_2005)),np.sqrt(np.var(y_year_pred_2005)))
print ("MEAN 2010")
print (np.mean(y_year_2010),np.mean(y_year_pred_2010))
print ("Standard deviation 2010")
print (np.sqrt(np.var(y_year_2010)),np.sqrt(np.var(y_year_pred_2010)))
print ("MEAN 2015")
print (np.mean(y_year_2015),np.mean(y_year_pred_2015))
print ("Standard deviation 2015")
print (np.sqrt(np.var(y_year_2015)),np.sqrt(np.var(y_year_pred_2015)))
plot_graphs(y_year_2005,y_year_pred_2005,"Year-2005")
plot_graphs(y_year_2010,y_year_pred_2010,"Year-2010")
plot_graphs(y_year_2015,y_year_pred_2015,"Year-2015")
# sns.scatterplot(data=y_year_2015,x = "month" , y ="rainfall(mm)" , hue = "YEAR")
MEAN 2005 121.2111111111111 106.49798150231581 Standard deviation 2005 123.77066107608005 76.08558540019236 MEAN 2010 139.93333333333334 112.18662987131034 Standard deviation 2010 135.71320250194282 84.35813629737333 MEAN 2015 88.52222222222223 96.76817006572782 Standard deviation 2015 86.62446123324875 52.45304841713268
In [91]:
from sklearn.svm import SVR
# SVM model
clf = SVR(kernel='rbf', gamma='auto', C=0.5, epsilon=0.2)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print (mean_absolute_error(y_test, y_pred))
115.32415990638656
In [92]:
#2005
y_year_pred_2005 = reg.predict(X_year_2005)
#2010
y_year_pred_2010 = reg.predict(X_year_2010)
#2015
y_year_pred_2015 = reg.predict(X_year_2015)
print ("MEAN 2005")
print (np.mean(y_year_2005),np.mean(y_year_pred_2005))
print ("Standard deviation 2005")
print (np.sqrt(np.var(y_year_2005)),np.sqrt(np.var(y_year_pred_2005)))
print ("MEAN 2010")
print (np.mean(y_year_2010),np.mean(y_year_pred_2010))
print ("Standard deviation 2010")
print (np.sqrt(np.var(y_year_2010)),np.sqrt(np.var(y_year_pred_2010)))
print ("MEAN 2015")
print (np.mean(y_year_2015),np.mean(y_year_pred_2015))
print ("Standard deviation 2015")
print (np.sqrt(np.var(y_year_2015)),np.sqrt(np.var(y_year_pred_2015)))
plot_graphs(y_year_2005,y_year_pred_2005,"Year-2005")
plot_graphs(y_year_2010,y_year_pred_2010,"Year-2010")
plot_graphs(y_year_2015,y_year_pred_2015,"Year-2015")
MEAN 2005 121.2111111111111 106.49798150231581 Standard deviation 2005 123.77066107608005 76.08558540019236 MEAN 2010 139.93333333333334 112.18662987131034 Standard deviation 2010 135.71320250194282 84.35813629737333 MEAN 2015 88.52222222222223 96.76817006572782 Standard deviation 2015 86.62446123324875 52.45304841713268
In [93]:
model.fit(x=np.expand_dims(X_train, axis=2), y=y_train, batch_size=64, epochs=10, verbose=1, validation_split=0.1, shuffle=True)
y_pred = model.predict(np.expand_dims(X_test, axis=2))
print (mean_absolute_error(y_test, y_pred))
Epoch 1/10 15/15 [==============================] - 0s 12ms/step - loss: 6831.6484 - mae: 59.8372 - val_loss: 4876.1357 - val_mae: 51.4384 Epoch 2/10 15/15 [==============================] - 0s 7ms/step - loss: 6211.3706 - mae: 57.3244 - val_loss: 4587.8960 - val_mae: 51.0172 Epoch 3/10 15/15 [==============================] - 0s 8ms/step - loss: 6016.0649 - mae: 57.2547 - val_loss: 4491.1641 - val_mae: 51.1315 Epoch 4/10 15/15 [==============================] - 0s 7ms/step - loss: 5869.5327 - mae: 57.0192 - val_loss: 4383.1455 - val_mae: 50.2226 Epoch 5/10 15/15 [==============================] - 0s 8ms/step - loss: 5785.2510 - mae: 55.5861 - val_loss: 4274.9116 - val_mae: 49.2660 Epoch 6/10 15/15 [==============================] - 0s 7ms/step - loss: 5707.2461 - mae: 55.2401 - val_loss: 4240.0122 - val_mae: 49.0595 Epoch 7/10 15/15 [==============================] - 0s 7ms/step - loss: 5649.3164 - mae: 54.9786 - val_loss: 4199.9941 - val_mae: 48.6766 Epoch 8/10 15/15 [==============================] - 0s 8ms/step - loss: 5592.8267 - mae: 54.6077 - val_loss: 4156.4116 - val_mae: 48.6950 Epoch 9/10 15/15 [==============================] - 0s 9ms/step - loss: 5551.9995 - mae: 54.5321 - val_loss: 4145.3403 - val_mae: 48.3177 Epoch 10/10 15/15 [==============================] - 0s 7ms/step - loss: 5511.2412 - mae: 53.8805 - val_loss: 4138.8491 - val_mae: 47.8778 1/1 [==============================] - 0s 35ms/step 62.04201758341355
In [94]:
#2005
y_year_pred_2005 = reg.predict(X_year_2005)
#2010
y_year_pred_2010 = reg.predict(X_year_2010)
#2015
y_year_pred_2015 = reg.predict(X_year_2015)
print ("MEAN 2005")
print (np.mean(y_year_2005),np.mean(y_year_pred_2005))
print ("Standard deviation 2005")
print (np.sqrt(np.var(y_year_2005)),np.sqrt(np.var(y_year_pred_2005)))
print ("MEAN 2010")
print (np.mean(y_year_2010),np.mean(y_year_pred_2010))
print ("Standard deviation 2010")
print (np.sqrt(np.var(y_year_2010)),np.sqrt(np.var(y_year_pred_2010)))
print ("MEAN 2015")
print (np.mean(y_year_2015),np.mean(y_year_pred_2015))
print ("Standard deviation 2015")
print (np.sqrt(np.var(y_year_2015)),np.sqrt(np.var(y_year_pred_2015)))
plot_graphs(y_year_2005,y_year_pred_2005,"Year-2005")
plot_graphs(y_year_2010,y_year_pred_2010,"Year-2010")
plot_graphs(y_year_2015,y_year_pred_2015,"Year-2015")
MEAN 2005 121.2111111111111 106.49798150231581 Standard deviation 2005 123.77066107608005 76.08558540019236 MEAN 2010 139.93333333333334 112.18662987131034 Standard deviation 2010 135.71320250194282 84.35813629737333 MEAN 2015 88.52222222222223 96.76817006572782 Standard deviation 2015 86.62446123324875 52.45304841713268
Prediction Observations¶
Training on complete dataset¶
| Algorithm | MAE |
|---|---|
| Linear Regression | 94.94821727619338 |
| SVR | 127.74073860203839 |
| Artificial neural nets | 85.2648713528865 |
Training on telangana dataset¶
| Algorithm | MAE |
|---|---|
| Linear Regression | 70.61463829282977 |
| SVR | 90.30526775954294 |
| Artificial neural nets | 59.95190786532157 |
- Neural Networks performs better than SVR etc.
- Observed MAE is very high which indicates machine learning models won't work well for prediction of rainfall.
- Telangana data has a single pattern that can be learned by models, rather than learning different patterns of all states. so has high accuracy.
- Analysed individual year rainfall patterns for 2005, 2010, 2015.
- Approximately close means, noticed less standard deviations.
District wise details¶
- Similar to above the number of attributes is same, we don’t have year in this.
- The amount of rainfall in mm for each district is added from 1950-2000.
- We analyse the data individually for the state Andhra Pradesh
In [95]:
district = pd.read_csv(r"H:/4th Year/Sem 8/MaP2/rainfall-prediction-master/data/district_wise_rainfall_normal.csv",sep=",")
district = district.fillna(district.mean(numeric_only=True))
district.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 641 entries, 0 to 640 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 STATE_UT_NAME 641 non-null object 1 DISTRICT 641 non-null object 2 JAN 641 non-null float64 3 FEB 641 non-null float64 4 MAR 641 non-null float64 5 APR 641 non-null float64 6 MAY 641 non-null float64 7 JUN 641 non-null float64 8 JUL 641 non-null float64 9 AUG 641 non-null float64 10 SEP 641 non-null float64 11 OCT 641 non-null float64 12 NOV 641 non-null float64 13 DEC 641 non-null float64 14 ANNUAL 641 non-null float64 15 Jan-Feb 641 non-null float64 16 Mar-May 641 non-null float64 17 Jun-Sep 641 non-null float64 18 Oct-Dec 641 non-null float64 dtypes: float64(17), object(2) memory usage: 95.3+ KB
In [96]:
district.head()
Out[96]:
| STATE_UT_NAME | DISTRICT | JAN | FEB | MAR | APR | MAY | JUN | JUL | AUG | SEP | OCT | NOV | DEC | ANNUAL | Jan-Feb | Mar-May | Jun-Sep | Oct-Dec | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ANDAMAN And NICOBAR ISLANDS | NICOBAR | 107.3 | 57.9 | 65.2 | 117.0 | 358.5 | 295.5 | 285.0 | 271.9 | 354.8 | 326.0 | 315.2 | 250.9 | 2805.2 | 165.2 | 540.7 | 1207.2 | 892.1 |
| 1 | ANDAMAN And NICOBAR ISLANDS | SOUTH ANDAMAN | 43.7 | 26.0 | 18.6 | 90.5 | 374.4 | 457.2 | 421.3 | 423.1 | 455.6 | 301.2 | 275.8 | 128.3 | 3015.7 | 69.7 | 483.5 | 1757.2 | 705.3 |
| 2 | ANDAMAN And NICOBAR ISLANDS | N & M ANDAMAN | 32.7 | 15.9 | 8.6 | 53.4 | 343.6 | 503.3 | 465.4 | 460.9 | 454.8 | 276.1 | 198.6 | 100.0 | 2913.3 | 48.6 | 405.6 | 1884.4 | 574.7 |
| 3 | ARUNACHAL PRADESH | LOHIT | 42.2 | 80.8 | 176.4 | 358.5 | 306.4 | 447.0 | 660.1 | 427.8 | 313.6 | 167.1 | 34.1 | 29.8 | 3043.8 | 123.0 | 841.3 | 1848.5 | 231.0 |
| 4 | ARUNACHAL PRADESH | EAST SIANG | 33.3 | 79.5 | 105.9 | 216.5 | 323.0 | 738.3 | 990.9 | 711.2 | 568.0 | 206.9 | 29.5 | 31.7 | 4034.7 | 112.8 | 645.4 | 3008.4 | 268.1 |
In [97]:
district[['DISTRICT', 'JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL',
'AUG', 'SEP', 'OCT', 'NOV', 'DEC']].groupby("DISTRICT").mean()[:40].plot.barh(stacked=True,figsize=(13,8));
In [98]:
district[['DISTRICT', 'Jan-Feb', 'Mar-May',
'Jun-Sep', 'Oct-Dec']].groupby("DISTRICT").sum()[:40].plot.barh(stacked=True,figsize=(16,8));
Observations¶
- The above two graphs shows the distribution of over each district.
- As there are large number of districts only 40 were shown in the graphs.
Andhra Pradesh Data
In [99]:
ap_data = district[district['STATE_UT_NAME'] == 'ANDHRA PRADESH']
In [100]:
ap_data[['DISTRICT', 'JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL',
'AUG', 'SEP', 'OCT', 'NOV', 'DEC']].groupby("DISTRICT").mean()[:40].plot.barh(stacked=True,figsize=(18,8));
In [101]:
ap_data[['DISTRICT', 'Jan-Feb', 'Mar-May',
'Jun-Sep', 'Oct-Dec']].groupby("DISTRICT").sum()[:40].plot.barh(stacked=True,figsize=(16,8));
Observations¶
- The above two graphs shows the distribution of over each district in Andhra Pradesh.
- The above graphs show that more amount of rainfall is found in srikakulam district, least amount of rainfall is found in Anantapur district.
- It also shows that almost all states have more amount of rainfall have more amount of rainfall in the months june, july, september.
In [102]:
plt.figure(figsize=(11,4))
sns.heatmap(ap_data[['Jan-Feb','Mar-May','Jun-Sep','Oct-Dec','ANNUAL']].corr(),annot=True)
plt.show()
In [103]:
plt.figure(figsize=(11,4))
sns.heatmap(ap_data[['JAN','FEB','MAR','APR','MAY','JUN','JUL','AUG','SEP','OCT','NOV','DEC','ANNUAL']].corr(),annot=True)
plt.show()
Observations¶
- It is observed that in Andhra Pradesh, annual rainfall depends more in the months of january, febuary.
- It also shows that if there is rainfall in months march, april, may then there is less amount of rainfall in the months june, july, august, september.
Predictions¶
- We used the same types of models and evaluation metrics used for the above dataset.
- We also tested the amount of rainfall in hyderabad by models trained on complete dataset and andhra pradesh dataset.
In [104]:
# testing and training for the complete data
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
division_data = np.asarray(district[['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL',
'AUG', 'SEP', 'OCT', 'NOV', 'DEC']])
X = None; y = None
for i in range(division_data.shape[1]-3):
if X is None:
X = division_data[:, i:i+3]
y = division_data[:, i+3]
else:
X = np.concatenate((X, division_data[:, i:i+3]), axis=0)
y = np.concatenate((y, division_data[:, i+3]), axis=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In [105]:
temp = district[['DISTRICT','JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL','AUG', 'SEP', 'OCT', 'NOV', 'DEC']].loc[district['STATE_UT_NAME'] == 'ANDHRA PRADESH']
hyd = np.asarray(temp[['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL','AUG', 'SEP', 'OCT', 'NOV', 'DEC']].loc[temp['DISTRICT'] == 'HYDERABAD'])
# print temp
X_year = None; y_year = None
for i in range(hyd.shape[1]-3):
if X_year is None:
X_year = hyd[:, i:i+3]
y_year = hyd[:, i+3]
else:
X_year = np.concatenate((X_year, hyd[:, i:i+3]), axis=0)
y_year = np.concatenate((y_year, hyd[:, i+3]), axis=0)
In [106]:
from sklearn import linear_model
# linear model
reg = linear_model.ElasticNet(alpha=0.5)
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
print (mean_absolute_error(y_test, y_pred))
57.08862331011229
In [107]:
y_year_pred = reg.predict(X_year)
print ("MEAN Hyderabad")
print (np.mean(y_year),np.mean(y_year_pred))
print ("Standard deviation hyderabad")
print (np.sqrt(np.var(y_year)),np.sqrt(np.var(y_year_pred)))
plot_graphs(y_year,y_year_pred,"Prediction in Hyderabad")
MEAN Hyderabad 91.48888888888888 108.2025052233288 Standard deviation hyderabad 69.2514651982091 58.90326979488765
In [108]:
from sklearn.svm import SVR
# SVM model
clf = SVR(gamma='auto', C=0.1, epsilon=0.2)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print (mean_absolute_error(y_test, y_pred))
116.60671510825178
In [109]:
y_year_pred = clf.predict(X_year)
print ("MEAN Hyderabad")
print (np.mean(y_year),np.mean(y_year_pred))
print ("Standard deviation hyderabad")
print (np.sqrt(np.var(y_year)),np.sqrt(np.var(y_year_pred)))
plot_graphs(y_year,y_year_pred,"Prediction in Hyderabad")
MEAN Hyderabad 91.48888888888888 80.34903236716154 Standard deviation hyderabad 69.2514651982091 0.14736007434982146
In [110]:
model.fit(x=np.expand_dims(X_train, axis=2), y=y_train, batch_size=64, epochs=10, verbose=1, validation_split=0.1, shuffle=True)
y_pred = model.predict(np.expand_dims(X_test, axis=2))
print (mean_absolute_error(y_test, y_pred))
Epoch 1/10 65/65 [==============================] - 0s 6ms/step - loss: 7410.1948 - mae: 53.1640 - val_loss: 3912.4832 - val_mae: 41.4290 Epoch 2/10 65/65 [==============================] - 0s 6ms/step - loss: 5438.0537 - mae: 44.2064 - val_loss: 3657.5393 - val_mae: 37.6070 Epoch 3/10 65/65 [==============================] - 0s 6ms/step - loss: 5199.6011 - mae: 42.8950 - val_loss: 3673.1670 - val_mae: 37.0153 Epoch 4/10 65/65 [==============================] - 0s 6ms/step - loss: 5046.6323 - mae: 41.5777 - val_loss: 3512.6877 - val_mae: 36.8002 Epoch 5/10 65/65 [==============================] - 1s 10ms/step - loss: 5012.4092 - mae: 41.7169 - val_loss: 3471.7871 - val_mae: 36.1732 Epoch 6/10 65/65 [==============================] - 0s 6ms/step - loss: 4903.3105 - mae: 41.2504 - val_loss: 3802.2419 - val_mae: 36.2144 Epoch 7/10 65/65 [==============================] - 0s 6ms/step - loss: 4838.3657 - mae: 40.7725 - val_loss: 3813.9683 - val_mae: 37.3677 Epoch 8/10 65/65 [==============================] - 0s 6ms/step - loss: 4896.7153 - mae: 41.1777 - val_loss: 3616.3206 - val_mae: 36.5161 Epoch 9/10 65/65 [==============================] - 0s 6ms/step - loss: 4763.4482 - mae: 40.3536 - val_loss: 4103.4834 - val_mae: 38.9619 Epoch 10/10 65/65 [==============================] - 0s 6ms/step - loss: 4733.0625 - mae: 40.5106 - val_loss: 3609.8958 - val_mae: 37.2121 37/37 [==============================] - 0s 3ms/step 42.28625546922097
In [111]:
y_year_pred = model.predict(np.expand_dims(X_year, axis=2))
print ("MEAN Hyderabad")
print (np.mean(y_year),np.mean(y_year_pred))
print ("Standard deviation hyderabad")
print (np.sqrt(np.var(y_year)),np.sqrt(np.var(y_year_pred)))
# plot_graphs(y_year,y_year_pred,"Prediction in Hyderabad")
1/1 [==============================] - 0s 41ms/step MEAN Hyderabad 91.48888888888888 108.403656 Standard deviation hyderabad 69.2514651982091 74.833984
In [112]:
# training and testing sets for only andhra pradesh data
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
division_data = np.asarray(ap_data[['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL',
'AUG', 'SEP', 'OCT', 'NOV', 'DEC']])
X = None; y = None
for i in range(division_data.shape[1]-3):
if X is None:
X = division_data[:, i:i+3]
y = division_data[:, i+3]
else:
X = np.concatenate((X, division_data[:, i:i+3]), axis=0)
y = np.concatenate((y, division_data[:, i+3]), axis=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In [113]:
from sklearn import linear_model
# linear model
reg = linear_model.ElasticNet(alpha=0.5)
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
print (mean_absolute_error(y_test, y_pred))
31.249748674622488
In [114]:
y_year_pred = reg.predict(X_year)
print ("MEAN Hyderabad")
print (np.mean(y_year),np.mean(y_year_pred))
print ("Standard deviation hyderabad")
print (np.sqrt(np.var(y_year)),np.sqrt(np.var(y_year_pred)))
plot_graphs(y_year,y_year_pred,"Prediction in Hyderabad")
MEAN Hyderabad 91.48888888888888 96.5489199306844 Standard deviation hyderabad 69.2514651982091 60.819355195446896
In [115]:
from sklearn.svm import SVR
# SVM model
clf = SVR(gamma='auto', C=0.1, epsilon=0.2)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print (mean_absolute_error(y_test, y_pred))
59.35057496896855
In [116]:
y_year_pred = clf.predict(X_year)
print ("MEAN Hyderabad")
print (np.mean(y_year),np.mean(y_year_pred))
print ("Standard deviation hyderabad")
print (np.sqrt(np.var(y_year)),np.sqrt(np.var(y_year_pred)))
plot_graphs(y_year,y_year_pred,"Prediction in Hyderabad")
MEAN Hyderabad 91.48888888888888 95.89978206795146 Standard deviation hyderabad 69.2514651982091 0.09247315036320868
In [117]:
model.fit(x=np.expand_dims(X_train, axis=2), y=y_train, batch_size=64, epochs=10, verbose=1, validation_split=0.1, shuffle=True)
y_pred = model.predict(np.expand_dims(X_test, axis=2))
print (mean_absolute_error(y_test, y_pred))
Epoch 1/10 3/3 [==============================] - 0s 51ms/step - loss: 1881.6719 - mae: 32.7400 - val_loss: 1128.7218 - val_mae: 23.7975 Epoch 2/10 3/3 [==============================] - 0s 25ms/step - loss: 1729.9153 - mae: 30.2609 - val_loss: 1073.9745 - val_mae: 24.3387 Epoch 3/10 3/3 [==============================] - 0s 37ms/step - loss: 1635.9270 - mae: 28.6857 - val_loss: 1071.7965 - val_mae: 25.3617 Epoch 4/10 3/3 [==============================] - 0s 34ms/step - loss: 1578.2692 - mae: 27.8114 - val_loss: 1044.4392 - val_mae: 25.4608 Epoch 5/10 3/3 [==============================] - 0s 30ms/step - loss: 1514.2677 - mae: 27.4533 - val_loss: 1018.0604 - val_mae: 25.2427 Epoch 6/10 3/3 [==============================] - 0s 29ms/step - loss: 1441.3817 - mae: 26.9870 - val_loss: 1004.6213 - val_mae: 24.9050 Epoch 7/10 3/3 [==============================] - 0s 29ms/step - loss: 1392.2823 - mae: 26.9143 - val_loss: 1011.1804 - val_mae: 24.7493 Epoch 8/10 3/3 [==============================] - 0s 28ms/step - loss: 1367.8698 - mae: 27.1685 - val_loss: 1023.4977 - val_mae: 24.7469 Epoch 9/10 3/3 [==============================] - 0s 29ms/step - loss: 1349.9089 - mae: 27.1171 - val_loss: 1032.5244 - val_mae: 24.6267 Epoch 10/10 3/3 [==============================] - 0s 28ms/step - loss: 1333.8815 - mae: 26.9949 - val_loss: 1036.2633 - val_mae: 24.1076 2/2 [==============================] - 0s 4ms/step 34.32628827776229
In [118]:
y_year_pred = model.predict(np.expand_dims(X_year, axis=2))
print ("MEAN Hyderabad")
print (np.mean(y_year),np.mean(y_year_pred))
print ("Standard deviation hyderabad")
print (np.sqrt(np.var(y_year)),np.sqrt(np.var(y_year_pred)))
# plot_graphs(y_train,y_year_pred,"Prediction in Hyderabad")
1/1 [==============================] - 0s 22ms/step MEAN Hyderabad 91.48888888888888 100.43834 Standard deviation hyderabad 69.2514651982091 63.309994
Prediction Observations¶
Training on complete dataset¶
| Algorithm | MAE |
|---|---|
| Linear Regression | 57.08862331011236 |
| SVR | 116.60671510825178 |
| Artificial neural nets | 44.329664907381066 |
Training on Hyderabad dataset¶
| Algorithm | MAE |
|---|---|
| Linear Regression | 31.249748674622477 |
| SVR | 59.35057496896855 |
| Artificial neural nets | 31.0601823988415 |
- Neural Networks performs better than SVR etc.
- Bad performance by SVR model.
- Andhra Pradesh data has a single pattern that can be learned by models, rather than learning different patterns of all states. so has high accuracy.
- Analysed individual year rainfall patterns for Hyderabad district.
- Approximately close means, noticed close standard deviations.
Conclusions¶
- Various visualizations of data are observed which helps in implementing the approaches for prediction.
- Prediction of amount of rainfall for both the types of dataset.
- Observations indicates machine learning models won't work well for prediction of rainfall due to fluctutaions in rainfall.